An Overview of Similarity Measures for Clustering XML Documents

نویسندگان

Giovanna Guerrini

Marco Mesiti

Ismael Sanz

چکیده

The large amount and heterogeneity of XML documents on the Web require the development of clustering techniques to group together similar documents. Documents can be grouped together according to their content, their structure, and links inside and among documents. For instance, grouping together documents with similar structures has interesting applications in the context of information extraction, of heterogeneous data integration, of personalized content delivery, of access control definition, of web site structural analysis, of comparison of RNA secondary structures. Many approaches have been proposed for evaluating the structural and content similarity between tree-based and vector-based representations of XML documents. Link-based similarity approaches developed for Web data clustering have been adapted for XML documents. This chapter discusses and compares the most relevant similarity measures and their employment for XML document clustering. INTRODUCTION XML is a markup language introduced by W3C (1998) that allows one to structure documents by means of nested tagged elements. The element tag allows the annotation of the semantic description of the element content and can be exploited in order to effectively retrieve only relevant documents. Thus, the document structure can be exploited for document retrieval. Moreover, through the Xlink language (W3C, 2001), different types of links can be specified among XML documents. In Xlink, a link is a relationship among two or more resources that can be described inside an XML document. These relationships can be exploited as well to improve document retrieval. The exponential growing of XML structured data available on the Web has raised the need of developing clustering techniques for XML documents. Web data clustering (Vakali et al., 2004) is the process of grouping Web data into clusters so that similar data belong to the same cluster and dissimilar data to different clusters. The goal of organizing data in such a way is to improve data availability and to fasten data access, so that Web information retrieval and content delivery on the Web are improved. Moreover, clustering together similar documents allows the development of homogeneous indexing structures and schemas that are more representative of such documents. XML documents can also be used for annotating Web resources (like articles, images, movies, and also Web Services). For example, an image can be coupled with an XML 2 document representing the image author and the date in which it has been shot as well as a textual description of its content or theme. A search engine can …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...

متن کامل

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...

متن کامل

The Impact of Ontology on the Performance of Information Retrieval : A Case of

The large amount and heterogeneity of XML documents on the Web requires the development of clustering techniques to group together similar documents. Documents can be grouped together according to their content, their structure, and the links inside and among the documents. For instance, grouping together documents with similar structure has interesting applications in the context of informatio...

متن کامل

A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity

Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. Therefore it is a new challenge for the field of data mining to turn these documents into a more useful information utility. We present a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according ...

متن کامل

An overview on XML similarity: Background, current trends and future directions

In recent years, XML has been established as a major means for information management, and has been broadly utilized for complex data representation (e.g. multimedia objects). Owing to an unparalleled increasing use of the XML standard, developing efficient techniques for comparing XML-based documents becomes essential in the database and information retrieval communities. In this paper, we pro...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

An Overview of Similarity Measures for Clustering XML Documents

نویسندگان

چکیده

منابع مشابه

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

The Impact of Ontology on the Performance of Information Retrieval : A Case of

A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity

An overview on XML similarity: Background, current trends and future directions

عنوان ژورنال:

اشتراک گذاری